Introduction

Tranposon insertion sequencing (Tn-seq) is a robust and sentive method to tackle the discovery of quantitative genetic interactions in with transposons microorganisms.This techniques combines massive parallel sequencing with transposon insertional mutagenesis. Transposons are DNA sequences that change their postion within the genome. In orer to carry out Tn-seq, a transposon insertion library containing a group of mutants that collectively have transposon insertions in all non-essential genes is created. The library is then grown under an experimental condition of interest. Mutants with transposons inserted in genes required for growth under the test condition will diminish in frequency from the population. Genomic sequences adjacent to the transposon ends are amplified by PCR and sequenced by massively parallel sequencing to determine the location and abundance of each insertion mutation. The importance of each gene for growth under the test condition is determined by comparing the abundance of each mutant before and after growth under the condition being examined. If an inserted mutation has negative effect on bacterial growth then a lower number of mapping reads will be found for that gene compared to the non-mutatated gene, the opposite occurs for insertations that positevily affect bacterial growth.

This is project is heavily based on the paper Tn-seq: high-throughput parallel sequencing for fitness and genetic interaction studies in microorganisms by van Opijnen, Bodi, and Camilli. They focused their work on a S. pneumoniae culture was transformed with the magellan6 transposon. In this paper they calculated a fitness score for each gene to determine the effect of the inserted mutation. This score is computed by first counting the number of mapping reads at each transposon insertion location for two different time points and then by by comparing the fold expansion of the mutant relative to the rest of the population with the following equation. The equation used is shown below.

Fitness Score

Fitness Score

Given the fitness score for each insertion, average fitness score per gene was then calculated by By averaging over all insertions in a specific gene. Based on this average score, genes were divided into four categories:

  • Neutral (Wi =0.96−1.04)
  • Advantageous (Wi > 1.04)
  • Disadvantageous (Wi < 0.96)
  • Possibly essential (Wi = 0)

Dependencies

The plots below display the number of genes belonging to each catecory as well as the the fitness value for each gene against the gene coordinate defined as the middle point between the start and the end of each gene.

fData_old <- read.delim("Tn_seq_fitness_data_Opijnen_et_al_2009.txt",
                        header=TRUE, stringsAsFactors = FALSE, sep = "\t")
# locus (gene locus) average_fitness    sd   sem N_insertions

# gene coordinates, gene locus + coordinates
geneCoord <- read.delim("GCF_000006885.1_ASM688v1_genomic_olt.txt",
                        header=FALSE, stringsAsFactors = FALSE, sep = "\t")

matches <- match(fData_old$locus,geneCoord$V1)
# 1813 matches in total

fData <- fData_old[!is.na(matches),]
essential <- length(fData$locus[fData$average_fitness==0])
# genes with fitness equal to 0 are considered essential genes
# transposons insertions occur only for non-essential genes

matches <- matches[!is.na(matches)]
geneCoord <- geneCoord[matches,]
genome_size <- max(geneCoord$V3)


fitness_df <- data.frame(x=(geneCoord$V2[fData$average_fitness!=0] +
                              geneCoord$V3[fData$average_fitness!=0])/2,
                         y=fData$average_fitness[fData$average_fitness!=0])
colnames(fitness_df) <- c('gene_coordinates', 'avg_fit_per_gene')

# according to the thresholds used in the articles,
# each gene is assigned to a insertion_type class depending on its avg fitness value
fitness_df$insertion_type <- cut(fitness_df$avg_fit_per_gene,
                                 c(0,0.96,1.04,Inf),
                                 c('Disadvantage','Neutral','Advantage'))

insertion_type <- as.data.frame(table(fitness_df$insertion_type))
levels(insertion_type$Var1) <- c(levels(insertion_type$Var1), 'essential')
insertion_type <- rbind(insertion_type,list('essential',essential))
colnames(insertion_type) <- c('group','value')
# pie chart for insertion types
 ggplot(insertion_type, aes(x="",y=value,fill=group)) +
  geom_bar(stat="identity", width=1, color="white") +
  coord_polar("y", start=0) + theme_void() +
  geom_text(aes(label=value), position = position_stack(vjust = 0.5)) +
  scale_fill_manual(name='Insertion Type', 
                    labels=c('Disadvantage','Neutral','Advantage','Essential'), 
                   values =c("#CC6666", "#9999CC", "#66CC99",'#FF9966'))

By looking at the second plot, it can be noticed that the fitness curve as a sort of ‘smile’ shape depicting a trend where genes lying close to the origin of replication shows higher fitness score than genes farther away. This is somehow expected due to the circular shape of bacterial genome. In fact, genome regions proximal to the origin of replication are present on average in a higher number of copies than distal ones. In this project we tried to tackle this problem by finding an appropriate pipeline to correct the average fitness scores.

Radians

The gene coordinates provided are linearized, though bacterial genome is circular therefore it should be preferred to work with circular coordinates instead of linear ones. For this matter, we transformed linear gene coordinates into radians. Moreover, genes at the two extremities have the same distance to the ORI, therefore radians coordinates have been rescaled so all coordinates range between 0 and \(\pi\) regardless of their location on the genome. In this way, instead of correcting the bias using one half of the genome, we can correct the bias by using the two halfs of the genome.

Linear Trend

We tried to remove the linear trend to see if this was enough to properly account for the bias introduced by the circularity of the genome. We first used a train-test approach. We randomly sampled half of the the genes and used those for training the linear model, and then tested the model on the remaining test genes. More precisely, we fitted the linear model on training data, used the model to predict the average fitness for the test genes, and then corrected the observed test gene fitness scores by removing the predicted scores and adding back the average test fitness score.


Call:
lm(formula = avg_fit_per_gene ~ scaled_radians, data = train_data)

Residuals:
      Min        1Q    Median        3Q       Max 
-0.282911 -0.011063  0.006009  0.018954  0.243152 

Coefficients:
                Estimate Std. Error t value Pr(>|t|)    
(Intercept)     1.030184   0.003258 316.192   <2e-16 ***
scaled_radians -0.015776   0.001764  -8.943   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.04606 on 784 degrees of freedom
Multiple R-squared:  0.09256,   Adjusted R-squared:  0.0914 
F-statistic: 79.97 on 1 and 784 DF,  p-value: < 2.2e-16


Call:
lm(formula = fitted_fitness ~ gene_coordinates, data = test_data)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.48876 -0.00928  0.01046  0.02400  0.23613 

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)      9.938e-01  4.160e-03 238.877   <2e-16 ***
gene_coordinates 1.803e-09  3.290e-09   0.548    0.584    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.05704 on 783 degrees of freedom
Multiple R-squared:  0.0003835, Adjusted R-squared:  -0.0008932 
F-statistic: 0.3004 on 1 and 783 DF,  p-value: 0.5838

The fitness score displays a constant trend indicating that the bias was accounted for. We further tested this by fitting a linear model with the correct fitness score as the response variable and gene coordinates as the explanatory variable. The p-value for slope term is quite large ( > 0.05), indicating there is no significant slope.

All Genes

Given this result, we applied the same correction procedure, using the entire dataset to fit the linear model this time.


Call:
lm(formula = avg_fit_per_gene ~ scaled_radians, data = fitness_df)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.49342 -0.00952  0.00841  0.02161  0.24572 

Coefficients:
                Estimate Std. Error t value Pr(>|t|)    
(Intercept)     1.028171   0.002626  391.54   <2e-16 ***
scaled_radians -0.016010   0.001436  -11.15   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.05187 on 1569 degrees of freedom
Multiple R-squared:  0.07345,   Adjusted R-squared:  0.07286 
F-statistic: 124.4 on 1 and 1569 DF,  p-value: < 2.2e-16

Window

We decided to use this same approach, though we considered windows of nucleotides and used the average point in the window (average radian for each window) as X in the linear model instead of all single genes, and the average fitness score per window as Y, where the average fitness value for the window is the average of all gene falling within the window). We decided to also try this method as a model fitted using all genes may be more sensitive to noise in the data. With this more robust approach we wanted to see if we could improve further the results. We tested different window sizes (10’000, 50’000, 100’000, 200’000).

result_window_size_df <- data.frame("Insertion Type"=rep(c('Disadvantage', 'Neutral','Advantage'),4),
                                    'Count'= rep(0,12), 
                                    'Correction'=rep(c('window10k', 'window50k','window100k', 'window200k'),
                                                     each=3))
result_window_size_df$Insertion.Type <- factor(c('Disadvantage', 'Neutral','Advantage'),levels = c('Disadvantage', 'Neutral','Advantage'))

#result_window_size_df$Correction<- factor(c('window10k', 'window50k','window100k', 'window200k'),labels = c('window10k', 'window50k','window100k', 'window200k'))

window_sizes <- c(10000, 50000, 100000, 200000)

for (window_size in window_sizes) {
  windows <- seq(0, genome_size, by=window_size)
  windows <- c(windows, genome_size)
  
  window <- cut(fitness_df$gene_coordinates, windows)
  fitness_df$window <- window
  head(fitness_df)
  
  # df per window
  avg_fit_per_window <- with(
    fitness_df, by(fitness_df$avg_fit_per_gene, fitness_df$window, mean))
  avg_fit_per_window <- sapply(avg_fit_per_window, mean)
  mid_radian <- sapply(with(
    fitness_df, by(fitness_df$scaled_radians, fitness_df$window, mean)),
    mean)
  fitness_window <- data.frame(window =levels(window),
                               avg_fitness=avg_fit_per_window)
  fitness_window$mid_radian <- mid_radian
  head(fitness_window)
  fitness_window$gene_number <- table(fitness_df$window)
  
  model_window <- lm(avg_fitness ~ mid_radian, fitness_window)
  summary(model_window)
  
  new_window_avg <- predict(model_window, fitness_window)
  fitness_df$new_pred_window <- fitness_df$avg_fit_per_gene -
    new_window_avg[fitness_df$window] + mean(fitness_df$avg_fit_per_gene)
  
  fitness_df$new_int_type_w <- cut(fitness_df$new_pred_window,
                                   c(0,0.96,1.04,Inf),
                                   c('Disadvantage','Neutral','Advantage'))
  
  
  table_count <- table(fitness_df$new_int_type)
  index <- match(window_size, window_sizes)
  shift <- (index - 1)*3
  result_window_size_df[(shift+1):(shift+3),'Count'] = table_count
  
} 

ggplot(result_window_size_df, aes(x=Insertion.Type, y=Count, fill=Correction)) +
  geom_bar(stat="identity", position = "dodge") +
  scale_fill_manual(name= 'Correction',
                    values = c("#CC6666", "#9999CC","#66CC99", "#FF9966"),
                    labels=c('Window 10K', 'Window 50k', 'Window 100k',
                             "Window 200k")) +
  labs(x = "Insertion Type", y = "Counts") 

Summary table
Disadvantage Neutral Advantage
window100k 170 1235 166
window10k 170 1235 166
window200k 170 1235 166
window50k 170 1235 166

All different window sizes returned very similar results. We chose 100,000 as the optimal window size as it was a good trade-off between number of windows to fit and number of nucleotides per window.

Window 100k


Call:
lm(formula = avg_fitness ~ mid_radian, data = fitness_window)

Residuals:
      Min        1Q    Median        3Q       Max 
-0.012666 -0.006892  0.001237  0.005757  0.012309 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  1.028285   0.003203 321.040  < 2e-16 ***
mid_radian  -0.016119   0.001779  -9.059 1.62e-08 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.007649 on 20 degrees of freedom
Multiple R-squared:  0.8041,    Adjusted R-squared:  0.7943 
F-statistic: 82.07 on 1 and 20 DF,  p-value: 1.622e-08

Gene Group

So far we have considered windows having the same length, though has the bar plot below shows, each window has a different number of genes. Therefore this time, instead of using windows of fixed length, we created windows of variable length but having fixed number of genes per window.

[1] 71.40909

We decided that we wanted to have as many windows as the number of windows we had for the window approach presented earlier. Therefore we cretated 22 windows each spanning either 71 or 72 genes. Then we repated the same steps as in the previous cases.


Call:
lm(formula = avg_fitness ~ `mid radian`, data = fitness_group)

Residuals:
       Min         1Q     Median         3Q        Max 
-0.0152788 -0.0045553  0.0008201  0.0043868  0.0113487 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)   1.028348   0.003405 302.028  < 2e-16 ***
`mid radian` -0.016126   0.001861  -8.664 3.33e-08 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.007931 on 20 degrees of freedom
Multiple R-squared:  0.7896,    Adjusted R-squared:  0.7791 
F-statistic: 75.07 on 1 and 20 DF,  p-value: 3.327e-08

Finding ORI e TER

Until now, we have used the fact that for this genome we know the real ORI and the orientation of the genome. Now we suppose that we do not know the ORI and so we have to find it by exploiting the fact that the bias seems to be larger for genes close to the ORI.

In order to find the ORI and the TER, we will fit a local linear regression (LOESS), which is a method that allow us to fit models with any shape by fitting a linear regression for each subset in which we divide the dataset. After fitting this model we will find the maximum of this curve, which represent the ORI, and the minimun, which represent the TER.

Fitting LOESS

Since we want to use LOESS (local linear regression) and this method do not allow us to be fitted for the extremity values of genomic coordinates, we use the fact that the genome is circular and extend the extremities.

Now we fit the LOESS. We use different span values, in order to understand if changing this parameter will affect the results in a significant way. The span parameter tell us how many neighbors we have to use for the local piece of the linear regression. For each fitted model then we find the maximum value and finally we check the difference between the results.

span_values <- c(0.25, 0.5, 0.75, 1.1)
oris <- c()

for (i in span_values) {
  # fit the LOESS
  model_loess <- loess(avg_fit_per_gene ~ new_gene_coordinates, data= new_data, 
                     span = i,degree = 2)
  summary(model_loess)

  # find max, which represent the ORI
  fitness_pred <-predict(model_loess, new_fitness_df)
  max(fitness_pred)
  fitness_pred[fitness_pred == max(fitness_pred)]
  max_fit_index <- match(max(fitness_pred), fitness_pred)
  max_fit_coord <- new_fitness_df$new_gene_coordinates[max_fit_index]
  
  # refine the research near the found max
  window_max <- seq((max_fit_coord - 100000), (max_fit_coord + 100000), 500)
  window_max_df <- data.frame( 'new_gene_coordinates' = window_max)
  # if coordinates does not fall in [0, genome_size] range we shift to ensure
  # it falls within the genome interval
  window_max_df$new_gene_coordinates <- ifelse(
    window_max_df$new_gene_coordinates < 0,
    window_max_df$new_gene_coordinates + genome_size,
    window_max_df$new_gene_coordinates)
  window_max_df$new_gene_coordinates <- ifelse(
    window_max_df$new_gene_coordinates > genome_size,
    window_max_df$new_gene_coordinates - genome_size,
    window_max_df$new_gene_coordinates)

  fitness_pred_max <- predict(model_loess, window_max_df)

  fitness_pred_max[fitness_pred_max == max(fitness_pred_max)]
  ori_index <- match(max(fitness_pred_max), fitness_pred_max)
  ori_coords <- window_max_df[ori_index,]

  oris <- c(oris, ori_coords)
}
# The values found using different span values are all near the real ORI, so
# this method works.
genome_size - random_start
[1] 1697934
[1] 1587424 1665184 1675415 1853698

As we can see, the values of possible ORIs are close enough to the real ORI, indicating that the span value do not impact too much on the results. This can be because this genome have only one ORI and so the fitted curve is slightly non linear.

Now we choose one ORI e reshift the genome according to this, in order to check if we have th u-shape, that means that we found the correct ORI.

Finding TER

Now we want to find the TER. In order to do this we follow the same steps made before for the ORI, so we fit a LOESS with the selected span value and we find the minimum value of this curve, which is the coordinate that have the less bias and so the genomic coordinate where the replication process end.

      198 
0.9850598 

As we can see from this plot, ORI and TER are correctly placed respectively on the region with higher and lower average fitness value.